Explore the 1000 Genomes Sample Information

This notebook demonstrates exploring the sample information for the 1000 Genomes dataset stored in BigQuery. Specifically you will:

  • Use the %%sql statement to write and execute SQL statements within the notebook
  • Extract data from BigQuery and create a local dataset that can be manipulated in Python
  • Refine and pivot your local dataset via the Pandas Python library
  • Visualize different aspects of your dataset via the Matplotlib and Seaborn Python libraries

Related Links:


NOTE:

  • If you're new to notebooks, or want to check out additional samples, check out the full list of general notebooks.
  • For additional Genomics samples, check out the full list of Genomics notebooks.

Working with data in BigQuery

First let's take a look at the 1000 Genomes sample dataset table.


In [1]:
import gcp.bigquery as bq
samples_table = bq.Table('genomics-public-data:1000_genomes.sample_info')
samples_table.schema


Out[1]:

We can see in the table schema that a number of different annotations exist for each genomic sample.

To get a feel for the dataset, let's see how the samples are distributed across populations and super populations.


In [2]:
%%sql --module pops
SELECT
  population,
  population_description,
  super_population,
  super_population_description,
  COUNT(population) AS population_count,
FROM
  $samples_table
GROUP BY
  population,
  population_description,
  super_population,
  super_population_description

In [3]:
bq.Query(pops, samples_table=samples_table).to_dataframe()


Out[3]:
population population_description super_population super_population_description population_count
0 GBR British in England and Scotland EUR European 107
1 FIN Finnish in Finland EUR European 105
2 CHS Southern Han Chinese, China EAS East Asian 171
3 PUR Puerto Rican in Puerto Rico AMR American 150
4 CDX Chinese Dai in Xishuangbanna, China EAS East Asian 109
5 CLM Colombian in Medellin, Colombia AMR American 148
6 IBS Iberian populations in Spain EUR European 162
7 PEL Peruvian in Lima, Peru AMR American 130
8 PJL Punjabi in Lahore,Pakistan SAS South Asian 158
9 KHV Kinh in Ho Chi Minh City, Vietnam EAS East Asian 124
10 ACB African Caribbean in Barbados AFR African 123
11 GWD Gambian in Western Division, The Gambia AFR African 180
12 ESN Esan in Nigeria AFR African 173
13 BEB Bengali in Bangladesh SAS South Asian 144
14 MSL Mende in Sierra Leone AFR African 128
15 STU Sri Lankan Tamil in the UK SAS South Asian 128
16 ITU Indian Telugu in the UK SAS South Asian 118
17 CEU Utah residents with Northern and Western Europ... EUR European 183
18 YRI Yoruba in Ibadan, Nigeria AFR African 186
19 CHB Han Chinese in Bejing, China EAS East Asian 108
20 JPT Japanese in Tokyo, Japan EAS East Asian 105
21 LWK Luhya in Webuye, Kenya AFR African 116
22 ASW African Ancestry in Southwest US AFR African 112
23 MXL Mexican Ancestry in Los Angeles, California AMR American 107
24 TSI Toscani in Italy EUR European 112
25 GIH Gujarati Indian in Houston,TX SAS South Asian 113

In order to further analyze our query results locally within the IPython notebook, let's convert the result set into a Pandas dataframe:


In [4]:
pops_df = bq.Query(pops, samples_table=samples_table).to_dataframe()
pops_df[:5]


Out[4]:
population population_description super_population super_population_description population_count
0 GBR British in England and Scotland EUR European 107
1 FIN Finnish in Finland EUR European 105
2 CHS Southern Han Chinese, China EAS East Asian 171
3 PUR Puerto Rican in Puerto Rico AMR American 150
4 CDX Chinese Dai in Xishuangbanna, China EAS East Asian 109

Pandas dataframes have a dataframe.plot() method that allows us to quickly render a number of standard visualizations. Let's draw a bar chart of the per-population sample counts:


In [5]:
import seaborn as sns
pops_df.plot(kind='bar', y='population_count')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3def0dc450>

Now that we have a dataframe object, we can compute arbitrary rollups and aggregations locally. For example, let's aggregate the count of samples from the population level to the super population level. We'll do this via the dataframe.groupby operation, which is generally of the form:

dataframe.groupby("column name").aggregation_func()


In [6]:
superpop_df = pops_df.groupby('super_population').sum()
superpop_df.plot(kind='bar')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3deef5bf10>

We see that the distribution of genome samples across population and super population is relatively uniform, the exception being the AFR (African) super population.

Genome sample metrics

Let's explore a few quantitative attributes of the genomic samples. We'll keep our super population sample classification, but also add some metrics around the extent to which given samples were sequenced, for exomic and low coverage regions. We'll further annotate our samples with a tag for the laboratory/center that produced performed the sequencing.


In [7]:
%%sql --module  metrics
select 
    super_population
    , total_lc_sequence -- lc=low coverage
    , total_exome_sequence
    , Main_Project_E_Centers
from $samples_table
where 
    total_exome_sequence is not null
    and total_lc_sequence is not null
    and Main_Project_E_Centers is not null
    and Main_Project_E_Centers != 'BCM,BGI' -- remove this single outlier

Again, we can convert these results to a Pandas dataframe for local analysis and visualization


In [8]:
df = bq.Query(metrics, samples_table=samples_table).to_dataframe()
df[:10]


Out[8]:
super_population total_lc_sequence total_exome_sequence Main_Project_E_Centers
0 EUR 14714705600 9580861000 WUGSC
1 EUR 30285067857 5856213714 BCM
2 EUR 25134554475 8125011458 BCM
3 EUR 42906753668 15144863140 BI
4 EUR 22115533074 11754108800 WUGSC
5 EUR 21621189887 9003406100 WUGSC
6 EUR 16331033600 9145497300 WUGSC
7 EUR 22262720677 5851507215 BCM
8 EUR 24284331308 6640380441 BCM
9 EUR 23778576248 9595593000 WUGSC

To get a feel for the quantitative sample attributes, let's see how their values are distributed among the dataset samples by plotting histograms. Note that a histogram of any dataframe attribute can be easily rendered with the pattern dataframe.attribute_name.hist()


In [9]:
df.hist(alpha=0.5, bins=30)


Out[9]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3deee3af50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3deec945d0>]], dtype=object)

What does the joint distribution of these two traits look like?


In [10]:
g = sns.jointplot('total_exome_sequence', 'total_lc_sequence', df)


Only a very slight positive correlation. Let's further annotate our scatter chart by rendering each mark with a color according to its super population assignment.


In [11]:
g = sns.lmplot('total_exome_sequence', 'total_lc_sequence', hue='super_population', data=df, fit_reg=False)


Now, let's take the same plot as above, by facet our results based upon the genomic sequencing center that produced it to look for inter-center variability in the dataset.


In [12]:
g = sns.lmplot('total_exome_sequence', 'total_lc_sequence', hue='super_population', col='Main_Project_E_Centers', col_wrap=2, data=df, fit_reg=False)


The WUSGC (the Genome Institute at Washington University) shows a small outlier cluster that is distinct relative to the other centers. The BCM (Baylor College of Medicine) facet appears the least variable within the exome sequencing dimension.

Are there any super population trends here? We can facet our data a second time, this time by super population to dig deeper.


In [13]:
g = sns.lmplot('total_exome_sequence', 'total_lc_sequence', hue='super_population', col='super_population', row='Main_Project_E_Centers', data=df, fit_reg=False)


With the data now faceted by super population as well as genomic center, the variability of the Broad institute data becomes more apparent. Furthermore, the outlier cluster noted earlier in the WUGSC facet appears to be primarily constituted by the EAS and AMR super populations, with no representation from the SAS super population.